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Any system with the ability to learn from a time series and predict the future must have 
a memory representing the information from the recent past. In cases where the external 
environment generating the time series has a fixed scale, the memory can be a simple shift 
register — a moving window of finite width extending into the past. The width of the window 
should be large enough to describe the largest scale relevant for predicting the signal. How- 
ever, such a traditional buffer is inappropriate if the longest relevant scale is not known a 
priori, or if the signal has structure at many different time scales. It is well known that sig- 
nals with scale-free long range correlations are found in many physical environments. Hence 
we argue in favor of a memory that is a scale-free fuzzy buffer which implicitly accounts for 
scale-free fluctuations in naturally generated signals. Based on a neuro-cognitive model of 
internal time, we construct a fuzzy buffer that optimally sacrifices the accuracy of informa- 
tion representation in order to represent exponentially long time scales without an explosion 
in capacity demands. Using several illustrative time series we demonstrate the advantage of 
the fuzzy buffer over the shift register in time series forecasting. We suggest that this method 
for representing time- varying signals may be of broad utility in a variety of applications. 



I. INTRODUCTION 

Time series forecasting is a generic problem that arises in many contexts ranging from under- 
standing the occurrence of solar flares to understanding stock market fluctuations. A basic question 
that arises with respect to forecasting is, how much of the recent past of the time series is required 
to generate accurate predictions for the future? If there are no significant correlations beyond a 
particular scale, then a shift register [T] of appropriate size should suffice to accurately hold the 
information from the recent past leading up to any moment. However, when the relevant statistics 
are unknown, it would be disadvantageous to subscribe to a fixed size shift register. There are 
many instances where we might be concerned that the relevant information needed to forecast 
the time series could be spread over very long timescales. For example, complex interconnected 
dynamical systems often exhibit scale-free long range correlations in their spatial and temporal 
fluctuations, commonly known as 1/f fluctuations. Such long range correlations can be found in 
statistics of natural images [2H1], speech and music [5], brain activity [6], economics [7] and even 
in human cognition [8H10| . Though the physical generating mechanism underlying such long range 
correlations is an active subject of debate among engineers and physicists [TT] , our interest here is 
to simply point out its ubiquitous existence with a focus on the following question: If an intelligent 
learner is to learn using the time series generated by real world complex systems which may have 
correlation over very long delays, is there an optimal way to represent the time series in the memory 
buffer of the learner? 

Here we argue that it is advantageous for a buffer with finite resources to reflect the natural 
scale-free temporal structure associated with the uncertainties of the world. If one were to a priori 
assume that the time series is generated by a system with long range correlations then an event 
that happened 100 seconds ago does not have to be represented as accurately in time as an event 
that happened 10 seconds ago. By sacrificing the accuracy in a scale-free fashion, the learner can 
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optimally gather the relevant statistics from the time series with a built-in assumption that the 
series exhibits long range correlations. In this paper, we describe such a scale-free fuzzy buffer and 
discuss its advantages over a shift register in extracting statistics and forecasting time series from 
generic external environment. 

Of course, representing the recent history in an optimal fashion is not sufficient to successfully 
predict the future time series. It is crucial to learn the relevant statistics with an efficient learning 
algorithm. When the processes generating the time series is unknown or highly complex, even 
simple statistical learning methods such as correlation and spectral analysis can extract the sig- 
nificant statistics underlying the time series. Though a variety of sophisticated machine learning 
algorithms exist (see e.g., [12\ 113]). there is none that is constructed to act on a fuzzy memory 
buffer to extract the relevant statistics. The choice of the learning algorithm is modular to the 
choice of the memory buffer. The focus of this paper is the memory buffer and not on the learning 
algorithm per se; in section 4 we use a simple linear regression algorithm to demonstrate the utility 
of the fuzzy buffer in time series forecasting. 

The layout of the paper is as follows. In section 2 we start with a mathematical motivation for 
capacity-accuracy tradeoff in the memory buffer based on some general properties of long-range 
correlated time series. We explain the criteria for optimally sacrificing accuracy of information 
representation in the memory buffer for the sake of capacity to represent longer time scales. In 
section 3 we describe a specific method for representing temporal history of the time series in a 
scale- free way based on a neuro-cognitive model of internal time, TILT [T3]. Mathematically, this 
method is equivalent to encoding the Laplace transform of the time series and approximating its 
inverse to reconstruct a fuzzy representation of the time series. We then construct the fuzzy buffer 
by imposing the criteria of optimally sacrificing the accuracy of information representation, and 
show that with limited resources the fuzzy buffer has the capacity to represent exponentially larger 
timescales in comparison to a shift register. In section 4, we compare the performance of the fuzzy 
buffer and the shift register in time series forecasting. Using an artificially generated long-range 
correlated time series and empirically observed time series for sunspots and Earth's temperature, 
we show that the fuzzy buffer consistently outperforms the shift register in time series forecasting. 
Finally, we conclude by pointing out that adopting such a fuzzy buffer as a baseline memory 
representation in statistical learning models could be very useful. 

II. MOTIVATION FOR CAPACITY- ACCURACY TRADEOFF 

Let us assume that the learner must learn and forecast a real valued time series that has a 
two point correlation function that falls off like a power law. Naturally occurring time series will 
most certainly contain more subtle features like higher order correlations, but they are currently 
irrelevant for motivating the need for a fuzzy buffer. Hence for simplicity, let v T represent a 
stationary time series indexed by time stamp r, which has a zero mean, finite variance, and power- 
law two point correlations, namely (v T v T i^ ~ 1/\t — r'| a , for large temporal differences |r — r'|. 
When a < 1, the time series is said to possess long range correlations |15j . Our aim here is 
to simply represent this time series in a memory buffer so as to optimally extract its statistical 
properties and forecast the future values of the time series. For this purpose, it is useful to view 
the time series from the perspective of it being generated by a generic statistical algorithm, the 
ARFIMA model \10\ [TBI I17j . The basic idea behind this algorithm is that white noise at each 
time step can be fractionally integrated to generate a time series with long range correlations. It 
turns out that the time series can be viewed as generated by an infinite auto-regressive generating 
function integrating white noise. Without loss of generality, consider the current time step in the 
time series to be r = 0, and let the time steps be uniformly spaced so that they can be simply 
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labeled by integers. The value v Q at the current time step is a linear combination of white noise rj 
and the values v n s from past times r = n. 

oo 

v = Vo + ^a(n)w n . (1) 

n=l 

The ARFIMA model uniquely specifies the regression coefficients a(n) in terms of the exponent a. 

(-i)" +1 r(d + i) 

a{n) = r(n + l)r(d-n + l)' (2) 

where cf is the fractional integration power given by d = (1 — a)/2. It is known that an ARFIMA 
time series is stationary and long range correlated with finite variance only when d £ (0, 1/2) or 
a G (0, 1) [El[I7]. The asymptotic behavior of a(n) for large n can be obtained by applying Euler's 
reflection formula and Stirling's formula to approximate the Gamma functions in eq. [2] 



lim a(n) 

n>l 



r(d + 1) sin(Trd) 



7T 



n^ l+d \ (3) 



The purpose of writing out eq. [T]is to simply note that the coefficient a(n) can be interpreted as 
a measure of the relevance of v n in predicting v Q . We shall use this interpretation to motivate an 
optimal way to represent the v n s in a memory buffer. If the buffer has unlimited storage resources, 
then all v n s can be represented with perfect accuracy, with a unique buffer node for each n. At 
each time step, the value in the n-th node of the buffer can be shifted to the n + 1-th node and the 
value v can be filled into the first node. This memory buffer is a shift register of infinite size, and 
it ensures a perfect representation of the past time series at each moment. Now the question we 
address here is, if the buffer has only finite storage resources, is there a way to optimally sacrifice 
the accuracy in order to represent long time scales? Since the relative importance of v n reduces with 
increasing n (as seen from eq. [2J, we propose that the ideal buffer should store a weighted average 
of v n s over monotonically increasing bin-sizes such that the information overlap between successive 
bins is a constant. Figure [T] represents this idea schematically. We motivate the construction of 
the ideal buffer based on the principle that both (i) the error induced by averaging and (ii) the 
information redundancy induced by averaging should be equally distributed over all scales that are 
represented. 

From fig. [TJ note that in a shift register (SR) the value v n will be stored in, and only in, the 
n-th node of the buffer. Let us now consider a smeared shift register (SSR) where v n is not only 
stored in the n-th node of the buffer, but is smeared across the nodes around the n-th node. The 
purpose of smearing is to acknowledge the existence of fluctuations inherent to the generation of 
the time series. The utility of smearing is best explained in the context of a binary valued time 
series corresponding to the occurrence of a stimulus (v n =l) or not (v n = 0) at each time step n. 
The stimulus represented by the series could be anything, like the occurrence of solar flares, or 
occurrence of lightning in a stormy night, or even occurrence of an economic depression. Let us 
say that the learner somehow makes an association that the occurrence of the stimulus at the 
m-th time step in the past it is a strong predictor of the current re-occurrence of the stimulus. If 
the learner acquired such a statistic based on only a few learning instantiations, then it would be 
advantageous for the learner to expect some fluctuations in the number m to account for natural 
fluctuations that exist in the generation of the time series. If the learner generalizes the learned 
statistic to values around the m-th time step, then while forecasting the future the learner will 
not just expect the re-occurrence of the stimulus exactly m time steps in the future of a prior 
occurrence, but will expect the re-occurrence in a spread-out fashion around m-th time step in the 
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FIG. 1. A sample time series {v n } with power-law two-point correlation is plotted w.r.t time(n). The 
current time step is taken to be n — 0, and each v n represents the value at the n-th time step in the past. 
The dotted curve shows a(n)(see eq. [lj, the relative importance of v n in predicting v D . The upper x-axis 
represents the shift register (SR) which stores each v n in a unique node. After each time step, the value 
stored in each node is shifted to the next node in the left, and the value in the last node is ejected. The 
middle x-axis represents a smeared shift register (SSR), where each node stores the weighted sum over a 
bin of SR nodes. The lower x-axis represents the optimally fuzzy buffer which is essentially a collection of a 
subset of SSR nodes. The bins surrounding each fuzzy buffer node indicates the range of SR nodes involved 
in its construction. The size of the bins and overlap between bins can be optimally chosen to reflect the 
behavior of a(n). 



future. This can happen if the relevance of v m in predicting v Q is shared by nodes surrounding 
the m-th node, which in turn can be achieved by smearing or re-distributing the value v m into 
nodes around the m-th node. Since the learner is unaware of the statistics underlying the time 
series prior to learning it, neither the value of m nor the spread in the fluctuations can be a priori 
guessed by the learner. But given the ubiquity of scale free fluctuations in naturally occurring time 
series, we argue that the best strategy for the learner is to represent the information from every 
time step in the past in a smeared fashion and require the smearing to be evenly spread across all 
timescales. Such a representation of the time series in the memory buffer is schematically shown 
as the smeared shift register (SSR) in the middle x axis of fig. [I] 

As shown in fig. [TJ the value in each SR node is distributed onto a set of neighboring SSR 
nodes. In effect, each SSR node essentially encodes a weighted sum over a set of SR nodes, which 
we shall refer to as a bin. To consider the effect of such smearing on the prediction of v Q , note 
from eq. [T] that taking a weighted sum of v n s over all the nodes in a bin is equivalent to treating 
a{n) to be a constant over the bin. If A n is the size of the bin, then A n da(n) jdn is a measure of 
the error induced in the prediction due to smearing. To estimate the optimal size of the bins, we 
use a simple guiding principle that the contribution of smearing to the error in prediction should 
be proportional to its contribution to the prediction itself. That is, 

. da(n) , . , A . 

This principle ensures, at least heuristically, that the smear-induced error is equally distributed 
over all scales relevant to the prediction. Since a{n) shows a power-law behavior for large n, this 
principle yields A n oc n. 
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If we had unlimited storage resources, a unique SSR node could be assigned for each n, rep- 
resenting the weighted sum over the bin of SR nodes centered around that n. But note that for 
large n, the bins corresponding to successive SSR nodes will be highly overlapping and most of 
the information represented by those nodes will be redundant. In a realistic situation with limited 
resources allocated for the buffer, we could make a smarter choice by representing only a subset 
of the SSR nodes in the buffer, thereby minimizing the information overlap between neighboring 
nodes while representing longer time scales in the buffer. To optimally pick out the SSR nodes 
that should be included in the buffer, we require the information overlap between neighboring 
nodes at all scales to be a constant. This principle ensures that the information redundancy is 
equally distributed over all scales relevant to the prediction. For motivational purpose, consider 
the simplest case where the nodes are chosen such that the bins are non-overlapping and precisely 
tile the entire timeline. In this case, there is zero information overlap between neighboring nodes. 
Since we have argued in the previous paragraph that the size of the bin around the ra-th SSR node 
should be proportional to n, the maximum time scale that can be represented by the buffer will 
be related to the exponential of the total number of nodes in the buffer. In comparison to a shift 
register where the number of nodes in the buffer is directly proportional to the longest time scale 
represented, this fuzzy buffer can represent information from much longer timescales. 

Although the fuzzy buffer represents the time series with fewer resources than the shift register, 
it is non-trivial to construct the fuzzy buffer we have just described without explicitly having access 
to a shift register storing the entire time series. A crucial feature of a memory buffer should be 
self-sufficiency. That is, the information represented in the buffer should evolve at each time step 
only from the incoming input and the already stored information in the buffer. However, in the 
scheme described in figure [TJ the SR is needed to construct the SSR which is needed to construct 
the fuzzy buffer. The information lost in weighted- averaging and discarding the intermediate SSR 
nodes, are essential in determining the information to be represented by the fuzzy buffer at the 
subsequent time step. Hence such a construction of the fuzzy buffer is not self sufficient to evolve 
in time. It is hard to argue that a fuzzy buffer saves resources if one needs to have much more 
extensive resources to construct it! In the next section, we describe a mathematically elegant 
construction of a representation of temporal history based on encoding and inverting the Laplace 
transformation of the time series. This method leads to a fuzzy buffer that is self-sufficient to 
evolve in time without requiring a shift register for its construction. As such, this method provides 
an efficient method for storing a scale-invariant representation of history. 



III. CONSTRUCTING A SCALE-FREE FUZZY BUFFER FROM A 
REPRESENTATION OF STIMULUS HISTORY 

In this section we describe a method for constructing the scale-free fuzzy buffer. We begin by 
describing a mathematical model of psychological time, called TILT [T3] , developed to account for 
findings from animal and human behavior. This model gives the mathematical basis for representing 
the stimulus history in a scale-free fashion with properties of the smeared shift register that was 
described in the previous section. We then describe several critical considerations necessary to 
implement the mathematical model into a set of buffer nodes that leads to the optimally fuzzy 
buffer. 

Consider a real valued function f (t) to generate the time series in real time r. Our aim now is to 
construct a memory that represents f(r) leading up to the present moment as activity distributed 
over a set of nodes. The shift register is a simple solution to this problem. One could construct a 
shift register from a set of nodes chained back to back such that at each time step the functional 
value of f is transmitted to the first node in the chain and the information from each node is 
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FIG. 2. The scale-free fuzzy representation - Each node in the t column is a leaky integrator with a specific 
decay constant s that is driven by the functional value f at each moment. The activity of the t column is 
transcribed at each moment by the operator L^ 1 to represent the past functional values in a scale-free fuzzy 
fashion in the T column. 



transmitted to the next node downstream. Under these circumstances, each node of the shift 
register will store the value of f from a specific moment in the past. Assuming that there is no 
error in transmitting from one node to the next and there are an infinite number of nodes, the shift 
register will accurately hold the entire history up to the present moment. 

We will now describe a more sophisticated method to represent history |14j . This method 
results in a fuzzy estimate of f(r) using two columns of nodes t and T as shown in fig. [2j The T 
column estimates f(r) up to the present moment, while the t column is an intermediate step used 
to construct T. The nodes in the t column are leaky integrators with decay constants denoted by 
s. Each leaky integrator independently gets activated by the value of f at any instant and gradually 
decays according to 

dt(r, s) . . . ._. 

— iJ-^ = - s t(r, 5 )+f(r). (5) 

At every instant, the information in the t column is transcribed into the T column through a 
linear operator . 

T(t,t) = ( ^^s k+1 t^(T,s) : where s = -k/r (6) 

Here k is any positive integer and t^(r, s) is the k-th derivative of t(r, s) with respect to s. The 
nodes of the T column are labeled by the parameter r and are in one to one correspondence with 
the nodes of the t column which are labeled by s. The correspondence between s and r is given 
by s = —k/r. We refer to r as the internal time because it turns out that at any moment r, a r 
node approximately represents the value of f at a time r + r in the past. The maximum value of r 
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FIG. 3. The function f(r) is generated by a stimulus presented twice in the recent past. Taking the present 
moment to be r = 0, the momentary activity distributed across the T column nodes is plotted. 

can be made as large as needed at the cost of resources, but for mathematical idealization we can 
take r to — oo. 

The crucial mathematical inspiration of this approach is that t(r, s) encodes the Laplace trans- 
form of the entire history of the function f at any moment r, and the operator L^ 1 approximately 
inverts the Laplace transform [18], with the approximation being almost perfect for large values 
of k. When k — > oo, T(t, r) is a faithful representation of the history of f from r to — oo, that is 
T(r, r) ~ f(r + r) for all values of r from to — oo. Hence when k — > oo, T behaves exactly like 
a shift register. Although the result of this computation is identical to a shift register, note that 
the mechanism is completely different. Information is not transmitted directly from one node in 
T column to the next, as in a shift register. Rather, at each moment, the column of t nodes holds 
information about the entire history of f and the T column extracts this information to reconstruct 
the history. This reconstruction is perfect when k — > oo, but is only approximate when k is finite. 
It turns out that the error in reconstruction behaves exactly as what we would expect in a smeared 
shift register described in the previous section. For example, if the current moment is r = 0, then 
the value of f at a particular past moment t d is accurately represented by the node r = r D when 
k — > oo. But when k is finite, the value of f at the past moment r D is smeared over a range of r 
nodes. It turns out that this smear is scale invariant and grows linearly with r a . 

To illustrate the behavior of T as a smeared shift register, consider the function f(r) generated 
by a stimulus that occurred twice in the recent past. With the present moment taken as r = 0, 
figure [3] shows a function f(r) that briefly takes non-zero values twice in the past (top), and the 
estimate of f(r) present in the T column (bottom). Note that there are two bumps in the T activity 
at internal times that roughly match the stimulus presentation times. The most important feature 
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of this buffer is that the time of the more recent presentation of the stimulus is more accurately 
represented than that of the earlier presentation. This can be seen from the fact that the peak 
around r = —7 is taller and sharper than the peak around r = —23. Thus the value of f from a 
moment in distant past is smeared over many more r nodes than the value of f from a more recent 
past moment. 

Furthermore, it turns out that the smear is precisely scale invariant. To illustrate this, consider 
f(r) to be a Dirac delta function at a moment r Q in the past, f(r) = <5(r — r Q ), and let the present 
moment be r = 0. Applying eqns. [5]and[6| we obtain 



In the above equation both r Q and r are negative; T(0, r) is the representation of the delta function 
input distributed over the set of r nodes in the T column. The delta function input in real time 
is represented as a smooth peaked function in the T column such that the area underlying this 
distribution over r is always 1, reflecting the area underlying the input function. This can be 
heuristically verified from fig. [3j where the area under the two bumps are the roughly the same. 
The term l/|r D | in the l.h.s of the above equation has the effect of reducing the size of the peak 
inversely with increasing r . The rest of the functional dependence is on the ratio (r /f), ensuring 
that the distribution shape linearly scales with r G . In this sense, T represents the history of f(r) 
with a scale invariant smear. To quantify how much smear is introduced, we can estimate the 
width of the peak as the standard deviation a of T(0, r) from the above equation, which for k > 2 
turns out to be 



a[T(0,r) 



k 



k-l 



(8) 



The infinitely sharp delta function input to f is thus smeared out in T, and the width of the smear 
linearly increases with the time of presentation of the input. Note that k is the only free parameter 
here and eq. [8] shows that k has an inverse influence on the smear: larger the fc-smaller the smear, 
and smaller the /c-larger the smear. Hence k can simply be interpreted as the smear index. In 
the limit k — > oo, the smear vanishes and the delta function input propagates into the T column 
exactly as delta function without spreading, as expected in a shift register. 

Finally, note that the linearity of eqns. [5] and [6] implies that a linear combination of different 
functions f will lead to a linear combination of representations in T. For a more elaborate math- 
ematical description, refer to [13] • In the description so far, r is continuous variable. In order to 
utilize these insights in machine learning applications, we need to construct a buffer with discrete 
values of r. In the following subsections, we discuss several implementation details for constructing 
a fuzzy buffer from T. 



A. Discretized Implementation 

Though mathematically convenient, it is not practical to represent all real values of r in the T 
column — only discrete values of r can be represented and there has to be a minimum r m j n and a 
maximum T max . One can in principle pick any set of r values, and this will fix the set of s values 
in the t column because of the one to one correspondence s = —k/r. 

The choice of the discrete set of nodes affects the way T is constructed from t. Note from eq. [6] 
that the L^ 1 operator has to take the fc-th derivative of t along the s axis, which will be strongly 
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affected by the discretization of the s-axis. However, for any discretized set of s values, we can 
appropriately define a discretized derivative that is a linear operator. For notational convenience, 
let us denote the activity at any moment t(r, s) as simply t(s). Since t is a column vector with 
the rows labeled by s, we can construct a derivative matrix [D] such that 



h(i) 



[D]t 



[D] fe t 



(9) 



The individual elements in the square matrix [D] depends on the set of s values. To compute 
these elements, consider any three successive nodes with s values s_i,s ,si. The discretized first 
derivative of t at s is given by 



t«( So ) 



t(si) - t(s D ) 



°ij ~ ^y^oj s — s—i ^\&oj ~ ij or — o (10) 
s\ — s [si-s-ij s - s-i [si — s-i\ 

The row in [D] corresponding to s Q will have non-zero entries only in the columns corresponding 
to s_i, s and si. These three entries can be read out as coefficients of t(s_i), t(s G ) and t(si) 
respectively in the r.h.s of the above equation. Thus the entire matrix [D] can be constructed from 
any chosen set s values. 

By taking the fc-th power of [D] , the L^ 1 operator can be trivially constructed and the activity 

of the T column with the chosen set of r values can be calculated at each moment^] We have now 
established that the T column activity can be constructed self-consistently from the t column for 
any set of r values. From this point onwards, we shall not concern ourselves with intermediate 
stage t column and simply focus on the T column. 



B. The optimal choice of nodes distribution 

We have shown that the entire history of f at any moment is represented in T column in a 
smeared fashion when k is finite. But the fact that a single delta function input is smeared over 
many t values implies that there is a lot of redundancy in information representation in the T 
column. This information redundancy can be reduced by optimally distributing the nodes along 
the T column. Let g(r) represent the number density of nodes along the T column. If we number 
the nodes in the T column by N, ranging from 1 to N max , then g(r) = dN/dr. More simply, if 
successive nodes are f and r + A, then g(r) = 1/A. 

In order to construct a truly scale-free buffer, we first note that the redundancy in information 
representation should be evenly spread over all time scales that are represented in the buffer. This 
constraint can be formulated in terms of mutual information shared by neighboring nodes in the 
buffer. In the appendix, we mathematically show that in the presence of scale free input signals, 
the mutual information shared by any two neighboring buffer nodes can be a constant only if 
g(r) oc 1/|t|. This choice of g(r) is exactly equivalent to the constraint introduced by eq. UJ in 
section 2. 

Let us now observe a couple of interesting consequences to the choice of g(r) oc 1/|t|. First 
note that this choice leads to the following arrangement of nodes in the T column. 

Train > T m i n (l -\- c) , T m i n (l + c) , • • • T m i n (l + c) max ^ = T m axi (H) 

and the total number of nodes is 

m _ 1 i ^°&( T max / T min) ,-. „s 

J "max — 1 T i / 1 i \ - V ±Z J 

log(l + Cj 



1 To accurately construct the fc-th derivative, we need k extra nodes in the top and bottom of the t column. These 
s values are needed in addition to those determined from the one to one correspondence with the chosen r values. 
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FIG. 4. Pattern of activity across the buffer nodes. The top row corresponds to <7(r)=constant, and the 
bottom row corresponds to g(r) oc l/|r|. The left panel plots the ordinal position ./V of each buffer node 
against its t value. If a total of N max nodes in the buffer are to represent the range from T m i n to T max , N 
increases linearly with r when gr(r)=constant, while N increases logarithmically when g(r) oc 1/|t|. The T 
activity spread across the different nodes for a delta function input (as in eq. [7]) is plotted in the right panel 
for three different input times in the past. Note that the pattern of activity gets more smeared for larger 
times when g(r)=constant, but the pattern stays the same with an overall translation and scale reduction 
when g(r) oc 1/\t\. 



Here the constant c controls the resolution and denotes the separation between neighboring nodes 
around |r| = 1. In a shift register, the total number of nodes N max is proportional to the longest 
time scale to be represented. But the iV max in eq. 12 is related to the logarithm of the longest 
timescale to be represented. This constitutes a tremendous saving of resources-in comparison to a 
shift register, exponentially larger timescales can be represented in the fuzzy buffer with the same 
amount of resources. 

Another interesting feature of the choice of g(r) oc 1/|t| is that the pattern of activity across 
the -/V max nodes is translationally invariant with time-in other words, the smear in the pattern of 
activity does not increase with time since the input. Recall that the activity across r values in 
response to a delta function input at a time r Q in the past is given by eq. [7j where the smear in 
the pattern increased proportional to |r G | (eq. [8]). Figure. [I] shows the activity across the iV max 
nodes for three values of r Q . When g(r) is a constant, the distribution is more smeared for larger 
| t |, but when g(r) oc 1/|t|, the pattern simply gets translated with an overall scale reduction. It 
is also clear that the pattern is more symmetric around the peak when g(r) oc 1/|t|, while it is 
asymmetric when gir) is a constant. This can be heuristically understood by noting that though 
the smear along the t axis is proportional to the time since the presentation of the delta function, 
the actual number of nodes representing that timescale is inversely proportional to the timescale 
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itself when g(r) oc 1/|t|. Hence when the pattern of T activity is plotted w.r.t to the actual nodes 
N rather than r, the smear is effectively a constant. 

It in fact turns out that g(r) oc 1/\t\ is the only choice for which the activity pattern across the 
nodes is translationally invariant. To see this consider two different values of r , say t\ and T2, in 
eq. 7J and let us denote the corresponding T activities as Ti(0, r) and T2(0, r) respectively. If we 
represent the r values of the iV-th node in the buffer by r n , then the pattern of activity across the 
nodes is translationally invariant if and only if Ti(0, tn) oc T 2 (0, TN+m) for some constant integer 
m. For this to hold true, we need the quantity 



Ti(0,tjv) 

T2(0,T7V + „ 



fc+1 



n 



T 2 



T N+m 



(13) 



to be independent of N. This is possible only when the quantity inside the power law form and 
the exponential form are separately independent of N. The power law form can be independent 
of N only if tn oc (1 + c) N , which implies g(r) oc 1/|t| (see eq. 11). The exponential form is 
generally dependent on N except when its argument is zero, which happens if (1 + c) m = T2/T1 
for some integer m. When c is small compared to 1 and Tijr\ is not very close to 1, there will 
always exists some integer m for which the equality will approximately hold. Hence the pattern of 
activity across the nodes can be translationally invariant only when g(r) oc 1/|t|. Though this is 
an interesting feature, we emphasize that the primary reason behind the choice of g(r) oc l/|r| is 
that it equally spreads information redundancy across all time scales as shown in the appendix. 



C. Setting k to minimize information redundancy while avoiding information loss 

The choice of g(r) oc 1/\t\ only ensures that the redundancy in information representation 
introduced due to smearing is equally distributed over buffer nodes. But equal distribution of 
information redundancy is not sufficient; we would also like to minimize information redundancy. 
First note that the choice of g(r) oc l/|r| does not completely specify the r values of the buffer 



nodes, because c remains a free parameter in eq. 11 For a given c, there will be a high information 



redundancy if k is too small, while there will be high information loss if k is too large. Heuristically, 
if k is too small for a given c, then the r values of neighboring nodes in the buffer will be sufficiently 
close so that many nodes will have similar activities in response to an input from the past, resulting 
in information redundancy. In contrast, if k is too large for a given c, the r values of neighboring 
nodes will be sufficiently distant so that the activities of all the nodes could be close to zero for 
inputs from certain times in the past, resulting in information loss. So we need to appropriately 
match c with k to balance and minimize the information redundancy and the information loss. 

The basic idea here is that the information redundancy will be minimal when the information 
from a single moment in the past is not spread over more than two neighboring nodes. To formalize 
this, consider a delta function input at a time r D in the past and let the current moment be r = 0. 
We shall now look at the activity induced by this input (eq. [7]) in four successive buffer nodes, 
N — 1, N and N + 1 and N + 2. The r values of these nodes are given by eq. 
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for instance 

tn = T m i n (l + c)^ -1 and tn+i = T m i n (l + c) N . From eq. [7, it can be seen that the iV-th node 
attains its maximum value when t q = tn and the N + 1-th node attains its maximum value when 
To = tn+i, and for all the intervening times of r„ between tn and rj\r+i, the information about 
the delta function input will be spread over both iV-th and the N + 1-th nodes. To minimize the 
information redundancy, we simply require that when r D is in between tn and tn+i, all the nodes 
other than the iV-th and the iV + 1-th nodes should have almost zero activity. 
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FIG. 5. a. The activity of four successive nodes with r values given by tjv— i, tn, tjv+i, and tjv +2 in 
response to a delta function input at a past moment r Q = (rjv/2 + ryv+i)/2. The nodes are chosen according 

with c = 1. b. The sum of activity of the nodes tjv and t^+i in response 
to a delta function input at various times t ranging between tn and tjv+i- For each fc, the activities are 
normalized to have values in the range of to 1. 

Fig. |5p plots the activity for values of r between rjy-1 and ttv+2, with c = 1, and when r Q 
is exactly in the middle of tn and tjv+i- For each value of fe, the activity is normalized so that 
it lies between and 1. The four vertical lines in fig. [5^, represent the 4 nodes and the dots 
represent the activity of the corresponding nodes. Observe that for k = 2 the activity of all 4 
nodes is substantially different from zero, implying a significant information redundancy. At the 
other extreme, the k = 100 case in fig. [5^i, shows that the activity of all the nodes are almost zero, 
implying that the information about the delta function input at time r = (ttv + tat+i)/2 has been 
lost. To minimize both the information loss and the information redundancy, the value of k should 
be neither too large nor too small. Note that for the k = 12 case in fig. [5^,, the activities of the 
N — 1-th and the N + 2-th nodes are almost zero, but activities of the N-th and N + 1-th nodes 
are non-zero. 

For any given value of c, a rough estimate of the appropriate k can be obtained by matching 
the difference in the r values of the neighboring nodes to the smear a (the standard deviation as 
measured by eq. [8) in the distribution over the r values. 



\T N+l\ 

a 



k 



k-l 



~ TjV+l — TJV => — • (14) 



This condition implies that a large value of k will be required when c is small and a small value of 
k will be required when c is large. In particular, note that k ~ 8 when c = 1. 

We can construct a measure of information loss and use this as an additional constraint. Fig- 
ure [5)3 shows the sum of activity of the N-th and the iV + 1-th nodes for all values of r D between 
tn and T/v+i, with c = 1 for different values of k. For each k, the activities are normalized so 
that the iV-th node attains 1 when r D = r^v- Now let us focus on the k = 100 case in fig. ~^p. 
There is a range of r D values for which the total activity of the two nodes is very close to zero. 
The input is represented purely by the iV-th node when r is close to , and is represented purely 
by the N + 1-th node when r Q is close to rjv+ij but at intermediate values of r G the input is not 
represented by any node. To avoid such information loss, we shall require that the total activity 
of the two nodes should not have a local minimum — in other words the minimum should be at the 
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boundary, at r Q = r/v+i, as seen in figure for k =4, 8 and 12. For c = 1, it turns out that 
there exists a local minimum in the total activity of the two nodes only for values of k greater than 
12. For any given c, the appropriate value of k that minimizes the information redundancy and 
information loss can be estimated by examining a plot similar to fig. ^jp with the requirement of 
absence of local minimum. 

In summary, the optimally fuzzy buffer is the set of T column nodes with r values given by 
eq. 1 1 , with the value of k appropriately matched with c to minimize information redundancy and 
information loss. 



IV. UTILITY OF THE FUZZY BUFFER IN TIME SERIES FORECASTING 

We have constructed a buffer that satisfies the motivation provided in section 2 — optimally 
sacrificing accuracy to accommodate scale-free fluctuations and enhance the capacity to represent 
information from very long time scales. Now, with a few simple illustrations we shall compare the 
performance of the fuzzy buffer to a shift register in time series forecasting. We consider three 
time series with different properties. 

The first was generated by fractionally integrating white noise |10j in a manner similar to that 
described in section 2. The second and third time series were obtained from the online library 
at |http:/ /da tamarke t.com[ The second time series is the mean annual temperature of the Earth 
from the year 1781 to 1988. The third time series is the monthly average number of sunspots from 
the year 1749 to 1983 measured from Zurich, Switzerland. These three time series are plotted in 
the top row of fig. [6j The corresponding two point correlation function of each series is plotted 
in the middle row of fig. [6| Examination of the two point correlation functions reveal differences 
between the series. The fractionally-integrated noise series shows long-range correlations falling 
off like a power law. The temperature series shows correlations near zero (but modestly positive) 
over short ranges and weak negative correlation over longer times. The sunspots data has both 
strong positive short-range autocorrelation and a longer range negative correlation, balanced by a 
periodicity of 130 months corresponding to the 11 year solar cycle. 

Our goal here is to illustrate the differences between a simple shift register and the fuzzy buffer. 
Because our interest is in the effect of representing the time series in the memory buffer and not 
in the sophistication of the learning algorithm, we use simple linear regression algorithm to learn 
and forecast these time series. 



A. Learning and forecasting methodology 

Let N max denote the total number of nodes in the buffer and let N be an index corresponding 
to each node ranging from 1 to N max . We shall denote the value contained in the buffer nodes at 
any time step i by B{[N]. The time series was sequentially fed into both the shift register and the 
fuzzy buffer and the buffers were appropriately evolved at each time step. The values in the shift 
register nodes were shifted downstream at each time step as discussed section 2. At any instant the 
shift register held information from exactly N max time steps in the past. The values in the fuzzy 
buffer were evolved as described in section 3, with r values taken to be 1, 2, 4, 8, 16, 32,...2( Nmax ~ 1 \ 
conforming to eq. [IT] with r m j n = 1, c = 1 and k = 8. 

At each time step i, the value from each of the buffer nodes Bi[N] was recorded along with 
the value of the time series at that time step, denoted by V%. We used a simple linear regression 
algorithm to extract the intercept I and the regression coefficients Rn so that the predicted value 
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FIG. 6. Time series Forecasting. Top row - a. simulated time series with long range correlations based on 
ARFIMA model with d = 0.4, and white noise of standard deviation 0.01. b. time series of average annual 
temperature of the Earth from the year 1781 to 1988. c. time series of monthly average number of sunspots 
from the year 1749 to 1983. The middle row shows the two point correlations extracted from each of the 
time series directly below the time series themselves. The bottom row shows the error in forecasting the 
corresponding time series in the top row using the fuzzy buffer (red) and using the shift register (blue). 



of the time series at each time step Pi and the squared error in prediction Ei are 

Nmax 

P i = I+^2 R N Bi[N], Ei = [Pi - V,] 2 . (15) 

N=l 

The regression coefficients were extracted by minimizing the total squared error E = Yli^i- For 
this purpose, we used a standard procedure lm() in the open source software R. 
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The accuracy of forecast is inversely related to the total squared error E. To get an absolute 
measure of accuracy we have to factor out the intrinsic variability of the time series. In the bottom 
row of fig. [6| we plot the mean of the squared error divided by the intrinsic variance in the time 
series var(Vi), for various sizes N max of the buffer. This quantity would range from to 1; the 
closer it is to zero, the more accurate the prediction. 

a. Long range correlated series : The long range correlated series (fig. |6^) is by definition 
constructed to yield a two point correlation that decays as a power law. This is evident from its 
two point correlation in fig. [6]i that is decaying, but always positive. Since the value of the series 
at any time step is highly correlated with its value at the previous time step, we can expect to 
generate a reasonable forecast using a single node buffer that holds the value from the previous 
time step. This can be seen from fig. |6ja, where the error in forecast is only 0.45 with a single buffer 
node. Adding more buffer nodes reduces the error for both the shift register and the fuzzy buffer. 
But for a given size of the buffer, the fuzzy buffer always has a lower error than the shift register. 
This can be seen from fig. [6ji where the red curve (corresponding to the fuzzy buffer) is below the 
blue (corresponding to the shift register). 

Since this series is generated by fractionally integrating white noise, the mean squared error 
cannot in principle be lower than the variance of the white noise used for construction. That is, 
there is a lower bound for the error that can be achieved in fig. [6ji. The dotted line in fig. [HJi 
indicates this bound. Note that the fuzzy buffer approaches this bound with a much smaller 
number of nodes in the buffer than the shift register. 

b. Temperature series : The temperature series (fig. [Hja) is much more noisy than the long 
range correlated series, and apparently seems structureless. This can be seen from its small values 
of its two point correlations in fig. [6^. This is also reflected in the fact that with a small number 
of buffer nodes, the error is very high. Hence it can be concluded that no reliable short range 
correlation exist in this series. That is, knowing the average temperature during a given year does 
not help much in predicting the average temperature of the subsequent year. However, there seems 
to be a weak negative correlation at longer scales that could be exploited in forecasting. Note from 
fig. [6} that with additional nodes the fuzzy buffer performs better at forecasting and has a lower 
error in forecasting than a shift register. This is because the fuzzy buffer can represent much longer 
timescales than the shift register of equal size, and thereby exploit the long range correlations that 
exist. 

c. Sunspots series : The sunspot series (fig. |6p) is less noisy than the other two series con- 
sidered, and it has an oscillatory structure of about 130 month periodicity. It has high short range 
correlations, and hence even a buffer with one node that holds the value from the previous time 
step is sufficient to forecast with an error of only 0.16, as seen in fig. |BJ. As before, with more 
buffer nodes, the fuzzy buffer consistently has a lower error in forecasting than the shift register 
with equal number of nodes. Note that when the size of the buffer is increased from 4 to 8, the 
shift register does not improve in accuracy while the fuzzy buffer continues to improve in accuracy. 

The previous value in the time series provides most of the information required to predict the 
next value in the series. With c = 1 (as taken here), the shift register and the fuzzy buffer are 
precisely the same with only one node. Because most of the variance in the series can be captured 
by the first node, the difference between the fuzzy buffer and the shift register with additional nodes 
is not numerically overwhelming when viewed in fig. [6]. However, there is a qualitative difference 
in the properties of the signal that have been extracted by the two buffers. In order to successfully 
learn the 130 month periodicity, the information about high positive short range correlations is not 
sufficient, it is essential to also learn the information about the negative correlations at longer time 
scales. From fig. [6]F, note that the negative correlations exist at a timescale of 50 to 100 months. 
Hence in order to learn this information, these timescales have to be represented in the buffer. A 
shift register with 8 nodes cannot represent these timescales but the fuzzy buffer with 8 nodes can. 
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FIG. 7. Forecasting the distant future. The sunspots time series of length 2820 is extrapolated for 500 
time steps in the future using a a. shift register with 8 nodes, and using the b. fuzzy buffer with 8 nodes. 
The solid tick mark on the £-axis (at 2820) corresponds to the point where the original series ends and the 
predicted future series begins. 



To illustrate that it is possible to learn the periodicity using the fuzzy buffer, we forecast the 
distant future values of the series. In figure [7| we extend the sunspots series by predicting it for a 
future of 500 months. The regression coefficients Rn and the intercept / are extracted from the 
original series of length 2820. For the next 500 time steps, the predictions Pi are treated as actual 
values Vi, and are fed into the buffer to generate the prediction for the next step. Fig. [7^, shows the 
series generated by shift register with 8 nodes. The solid tick mark on the x-axis at 2820 represents 
the point at which the original series ends and the predicted future series begins. Note that the 
series forecasted by the shift register immediately settles on the mean value without oscillation. 
This is because the time scale at which the oscillations are manifest is not represented by the shift 
register with 8 nodes. Fig. [Tja shows the series generated by the fuzzy buffer with 8 nodes. Note 
that the series predicted by the fuzzy buffer continues in an oscillating fashion with decreasing 
amplitude for several cycles eventually settling at the mean value. This is possible because the 
fuzzy buffer represents the signal at a sufficiently long time scale to capture the negative correlations 
in the two-point correlation function. 

Of course, a shift register with many more nodes can capture the long-range correlations and 
learn to predict the periodic oscillations in the signal. However the number of nodes necessary 
to describe the oscillatory nature of the signal needs to be of the order of the periodicity of the 
oscillation, about 130 in this case. Though it might appear that a shift register with sufficiently 
large number of nodes is sufficient to such extract all the relevant statistics, note that this would lead 
to overfltting the data. At least in the case of the simple linear regression algorithm, the number of 
regression coefficients to be extracted from the data increases with the number of buffer nodes, and 
extracting a large number of regression coefficients from a finite data set will unquestionably lead 
to overfitting the data. Hence it would be ideal to use the least number of buffer nodes required 
to span the relevant time scale, as in the case of the fuzzy buffer. 
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V. DISCUSSION 

We have demonstrated that in situations where long range correlations are relevant and when 
the storage resources are finite, the fuzzy buffer is superior to the shift register as a memory 
system. Over and beyond representing exponentially longer time scales than a shift register, the 
fuzzy buffer has another useful feature that has not been explicated in the previous section, namely 
generalization. When the time series to be learned is sufficiently long, it is reasonable to assume 
that the stochasticity in the underlying processes generating the series is statistically well sampled, 
and the statistics extracted by the learner could indeed correspond to the underlying processes. The 
accuracy of forecasting hence fundamentally relies on the length of the training series — the number 
of learning instantiations provided to the learner. However, in real life situations we can be forced 
to forecast based on very few learning experiences. When there is not sufficient data to extract the 
statistics of the underlying processes, it is very advantageous to use a fuzzy buffer rather than a 
shift register. This is because the smeared representation of history in the fuzzy buffer implicitly 
accounts for the scale free fluctuations in the natural external world. In section 2, we motivated 
the utility of smearing in the context of a binary valued time series denoting the occurrence of a 
stimulus. We shall now briefly expand on this to point out that a smeared representation of history 
in the buffer can speed up learning by facilitating generalization. 

Consider a situation in which a delta function stimulus at a time r in the past is followed by 
some relevant outcome. Let there be a distribution p(r) of r values for which the stimulus yields 
the outcome. Suppose that a single value of r, say r Q , is chosen from the distribution and the 
stimulus-outcome sequence is presented to the learner. If the time of occurrence of the stimulus 
is represented in the learner's memory without any smear, as in a shift register, then observing 
the single value r does not allow the learner to generalize to other possible values of r. In a shift 
register, the representation of an event 100 time steps in the past is categorically distinct from the 
representation of an event 101 time steps in the past. While this property might be optimal in 
laboratory conditions with precisely timed stimulus-outcome sequence, we would expect that p(r) 
has some intrinsic spread for naturally occurring events. While we cannot know the properties of 
p(r) from observing a single value r D , we can commit to the scalar property based on the ubiquitous 
scale-free fluctuations in the world. Information about the precise time of the event is smeared 
out in T so that the learner cannot perfectly distinguish r G from neighboring values. Because the 
learner cannot perfectly distinguish two events with similar values of r, the response to those two 
events will also not be perfectly distinguishable. The scalar smear in T can hence be seen as an 
attempt to generalize, that is, attempt to learn the distribution p(r) based on the single observed 
value t d . 

In this paper, we have argued that a fuzzy buffer offers several advantages over a shift register 
for representing time-varying information subject to capacity constraints. If this is in fact the 
case, it seems natural to wonder if such a scale-free fuzzy representation of the past resembles the 
memory of human and animal learners. After all, animals have evolved in the natural world where 
predicting the imminent future is crucial for survival. Given the ubiquitous existence of scale-free 
fluctuations in the natural world, it would have been evolutionarily adaptive for the animals to 
have developed a memory system that implicitly exploits the existence of such fluctuations. In 
fact, TILT was developed to account for findings from experimental and cognitive psychology j!4j . 
Numerous behavioral findings from learning and memory as well as timing tasks are consistent 
with a scale-free representation of past events |19[ I20j. In human memory studies, the forgetting 
curve is usually observed to follow a power law function, which is of course scale-invariant |21| [22] . 
When humans are asked to reproduce or discriminate time intervals, they exhibit a characteristic 
scale-invariance in the errors they produce |23| 124] . This is not just a characteristic feature in 
humans, but in a wide variety of animal species like rats, rabbits and pigeons, as demonstrated 
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by classical conditioning experiments |25l [26] . These findings seem to suggest that humans and 
animals might have a memory system that represents the past events in a scale-free fuzzy fashion. 

Regardless of whether the fuzzy buffer is a valid model of human memory, we propose that 
it would be a very useful memory system for an artificial intelligent agent attempting to learn 
the statistics in real world situations. To emphasize that the utility of the fuzzy buffer goes well 
beyond the simplified illustrations in this paper, we note the following two points, (i) Invariably, in 
real situations the agent will have to learn a multi-component (or a vector valued) time series and 
extract correlational or causal relationships across different components. Though in this paper we 
have only focused on representing a single component time series in the buffer, it is easy to see that 
each component of the time series can be represented in a separate fuzzy buffer and the learning 
algorithm, whatever it may be, can act on these buffers to extract the relevant statistics, (ii) In 
constructing the fuzzy buffer we simply aimed to optimally represented a stochastic time series 
with power law two point correlations (see eq. [I]). However, the time series to be learned in real 
life situations would generally have meaningful higher order correlations, especially if the series 
generator involves non-Gaussian stochasticity. But as long as the series is optimally represented 
in the memory buffer respecting the leading order statistics (the two point correlations and fluc- 
tuations), then the learning algorithm should have a relatively easy task in extracting the higher 
order correlations. Clearly, the simplified linear regression learning algorithm used in section 5 
would have completely ignored any higher order correlations, and that may possibly have led to 
large inaccuracies in forecasting the temperature series (see fig. [6ji); a more sophisticated learning 
algorithm could have forecasted much better. 

In other words, the fuzzy buffer by itself is not a solution to machine learning and artificial 
intelligence problems; one needs an efficient learning algorithm to act on the fuzzy buffer. On the 
one hand there exists efficient machine learning algorithms like support vector machines \12\ 12 7j 
and deep belief networks [13] that learn the statistics by batch processing the entire time series 
data, requiring the entire time series to be accurately accessible at once. On the other hand there 
exists incremental learning algorithms in cognitive neuroscience like the adaptive resonance theory 
|28] , that can learn on the fly - in an online fashion without waiting for the entire time series to be 
available, a feature mimicking human learning. However, there does not exist a learning algorithm 
designed to work on a fuzzy buffer described as here, and constructing such a learning algorithm 
would be very fruitful. In short, we propose the fuzzy buffer as a baseline memory representation 
for statistical learning in general. 



VI. CONCLUSION 



Signals with long-range temporal correlations are ubiquitous in the natural world. Such signals 
present a distinct challenge to machine learners that rely on a shift-register representation of the 
time series. Here we have described an efficient method for constructing a scale-free representa- 
tion of temporal history, T. The scale- free smear in the T representation facilitates the learner 
to quickly generalize and accommodate for the inherent scale-free temporal fluctuations in the 
natural world. The optimally fuzzy buffer is constructed by choosing the distribution of nodes 
that minimizes the information redundancy and information loss, and equally distributes them to 
all timescales. With a given number of nodes, the fuzzy buffer can represent information from 
exponentially larger time scales when compared to a shift register. This representation of temporal 
history may be an extremely useful way to represent time series with long-range correlations for 
use in machine learning applications. 
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APPENDIX: INFORMATION REDUNDANCY ACROSS NODES 

Here we quantify information redundancy by deriving explicit expressions for mutual informa- 
tion shared between neighboring buffer nodes. When the input signals are white noise or long-range 
correlated signals with scale- free two point correlations, we show that equally distributing infor- 
mation redundancy to all scales requires g(r) oc 1/|t|. 

Information about f(r) is distributed among all the T nodes in a smeared fashion. This leads 
to redundancy in the information represented in these nodes. In order to more clearly understand 
the ability of the T column to represent information regarding f(r), we analyze the statistical 
properties of the T column nodes when f(r) is driven by a stochastic input. Taking the current 
moment to be r = 0, the activity of a r node in the T column is given by 

T(0,r)=/° J_ K) k e - k (j)i( T ')dr' (Al) 

J-oo \t\ \T / 

The expectation value of this node can be calculated by simply averaging over f(r') inside the 
integral, which should be a constant if it is generated by a stationary process. By defining z = r'/r, 
we find that the expectation of T is proportional to the expectation of f. 

<T(0,r)> = (f) / z k e~ kz dz (A2) 
Jo 

To understand the information representation in terms of correlations among the nodes, we cal- 
culate the correlations and mutual information among the T nodes when f(r) is white noise and 
long-range correlated noise. 



White-noise f(r) 

Let f(r) to be white noise, that is (f) = and (f(r)f(r')) ~ 5(t — t'). The variance in the 
activity of each r node is then given by 

<T2(0 ^/1/-L^ (if " (f) (?)' " (f W»**' 

= 4- / z 2k e~ 2kz dz (A3) 
|r| Jo 

As expected, the variance of a large \t\ node is small because the activity in this node is constructed 
by integrating the input function over a large timescale. This induces an artificial temporal cor- 
relation that does not exist in the input function. To see this more clearly, we calculate the time 
correlation in the activity of a single node, (T(t, t) T(t', t)). With the definition 5 = |r — r'|/|r|, 
it turns out that 

<T(r,r) T(r',r)> = ^e-^S^^^^f^e-^dz (A4) 

Note that this correlation is nonzero for any 5 > 0, and it decays exponentially for large 5. Hence 
even a temporally uncorrelated white noise input leads to short range temporal correlations in a 
r node. It is important to emphasize here that such temporal correlations will not be introduced 
in a shift register. This is because, in a shift register the functional value of f at each moment is 
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just passed on to the downstream nodes in the chain without being integrated, and the temporal 
autocorrelation in the activity of any node will simply reflect the temporal correlation in the input 
function. 

Let us now consider the instantaneous correlation in the activity of two different nodes. At any 
instant, the activity of two different nodes in a shift register will be uncorrelated in response to a 
white noise input. The different nodes in a shift register carry completely different information, 
making their mutual information zero. But in the T column, since the information is smeared across 
different r nodes, the mutual information shared by different nodes is non-zero. The instantaneous 
correlation between two different nodes t\ and T2 can be calculated to be 



The instantaneous correlation in the activity of the two nodes t\ and T2 is a measure of the mutual 
information represented by them. Factoring out the individual variances of the two nodes, we have 
the following measure for the mutual information. 



This quantity is high when T1/T2 is close to 1. That is, the mutual information shared between 
neighboring nodes will be the maximum. 

The fact that the mutual information shared by neighboring nodes is non-vanishing implies 
that there is redundancy in the representation of the information in the set of nodes. Now, to 
formally apply the principle motivated in section 2, we require that the information redundancy 
should be equally distributed across all time scales represented by the buffer. Put another way, 
the redundancy in information representation should be scale-free. Mathematically this can be 
achieved by setting the mutual information between any two adjacent nodes to be a constant. If t\ 
and T2 are any two neighboring nodes, then in order for X{t\,T2) to be a constant, T1/T2 should 
be a constant. This can happen only if the r values of the nodes are arranged in the form given 
by eq. [ill In other words, this requirement implies that the density of nodes g(r) oc l/|r|. 



We will now show that choosing g(r) oc l/|r| also equalizes the mutual information between all 
adjacent nodes even when the input has power-law correlations. Consider the input i"(t) such that 




(A5) 




(A6) 



Inputs with long range correlations 




Reworking the calculations analogous to those 



leading to eq. A4 , we find that the time correlation of a node with itself is 




(A7) 



where 8 = \r — t'|/|t| and C r = - 
note that it is a positive number. 



k\k+r)\ 2 k ~ r 
r\{k-r)\ (fc)fc+H-l 



. The value of C r is unimportant, we only need to 
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When a > 1, the integral diverges at v = —5, however we are only interested in the case a < 1. 
When 5 is very large, the entire contribution to the integral comes from the region \v\ <C 5 and the 
denominator of the integrand can be approximated as 5 a . In effect, 

(T(r,r) T(t',t)} ~ |r|- a (T a = |r-r'|- Q (A8) 

for large | r — r'\. The temporal autocorrelation of the activity of any node should exactly reflect 
the temporal correlations in the input when | r — r' | is much larger than the time scale of integration 
of that node (r). As a point of comparison, it is useful to note that any node in a shift register 
will also exactly reflect the correlations in the input. 

Let us now look at the instantaneous correlations across different nodes. In a shift register, 

if; if; ift if; 

the instantaneous correlation between two nodes t\ and tz, will simply be |r 2 — ti| -q - The 
instantaneous correlation between two nodes in T column turns out to be 

<T(0A) T(0,; 2 )> = \hrpX r ^±£^l (A9) 

^; ift i * i i * i 

Here j3 = |ti|/|t 2 | and each X r is a positive coefficient. By always choosing |r 2 | > | , we note 
the two limiting cases of interest, when (3 <C 1 and when f3 ~ 1. 

When j3 <C 1, the r = k term in the summation of the above equation yields the leading term, 
and the correlation is simply proportional to |r 2 | _Q! , which is approximately equal to \i~2 — T\\~ a . 

* i i*i 

In this limit where T2\ 3> | ti | , the correlation between the two nodes behaves like the correlation 



between two shift register nodes. When f3 ~ 1, note from eq. A9 that the correlation will still be 
proportional to 1 7"2 1 —a: . Now if t\ and r 2 are neighboring nodes with close enough values, we can 
evaluate the mutual information between them to be 

ti* * ^ (T(0,n)T(0,r 2 )) * * /2 

1{T1,T2) = — p OC \T2/Tl\ (A10) 

(T2(0,^))<T2(0,^)) 

Reiterating our requirement from before that the mutual information shared by neighboring 
nodes at all scales should be the same, we are once again led to choose T2/T1 to be a constant or 
equivalently g{r) oc l/|r|. 
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